26 118 590 



TH 005 074 



TITLE 
50B DISS 
ItOT£ 



BDSS PBICE 
DESCRIPTORS 



DoxrloHr Shoaas P. 

Estmblislxing Ipjpropriate- Time Xiiits fpr Tests* 
[Hot 73 3 

22p*; Paper presented at the Aimual Seeting of -the 
Northeast Educational Research issociaiton 
(EllenYillet Hew TorJCr Kofember 1973) 

HF-^fO*83 HC-$1*67 Plus Postage 

♦Statistical Analysis; *Test Construction; *Tia.ed 
Tests \ 



ABSTEICT' 

The iaplicatioas of rarious* ti»e 1-iaits^ for tests of 
a tix0^ lengti/or nuiber of items seei obvloas* If the tiie periii:ted 
is auch too shorty scores aay bunch up at the l^o^ end of th« 
potential score range^ with a loss of' potential rariance and a 
general diainishiag-of the utility of th« iratiance which is obserwed* 
If the^tia€ p€r*itt«d is au interaediate Yiilui> the scores^t^d to 
becoae soie aixtare of power and speed* i;his' paper proposes a>s^apie 
technique for esti»ating the aean and Standard deviation of the 
distribution of. finishing ti»es for a population of test takers* 
Giwen suq^h Yalues^ a muaber of decisions concerning test 
specifications can be aade. (Jiuthor/DSP) 



* DocuMOTits acguired by ERIC include akny infotrual unpublished 

* aateriaiy^ not available from other sources* 2EIC aakes every effort 
f to obtain the best copy available ♦ ?lewertneles5 # items of marginal 

* xei^foducibility are often encountered and this affects the quality 

* Of i^he microfiche and hardcopy reproductions ERIC m^irkes available 

* via the ERIC Docuiient Reproduction Service (EDRS) • 2DSS is not 

* responsible for the quality of the original document* R^eproductions 

* s^upplied by BDFS are the best that can he made from 'the original. 



4r 

4c 



vn 



CI 



Establishing Appropriate* Tlrae Litaits for Tests 



I KftUCATIOM Wttr Aftl 

MATtOMAt IHlTITUtI OF 
iCMJCAtlON 
THIS DOCUMENT HAS ftCEN »CP«0 
01>CED EXACTtV M ftCCEWeO riRO\\ 

ATlHG^It KJIHTS Of VIEW C» p^lHlONS 
STATED 00 NOVX«CES$Aqn,V 
SCNTOf F'ClAt NATIONAL IHSTUUTEOF 
lDU<At»0« ^OSiTiOM OS? POCtCY 



Thomas' F» Donlon 
Educational Testing Service 
Princeton, N» J* 



Presented at the Fall, 1973 Meeting 
Northeast Educational Research Association 



ERIC: 



Establishing Appropriate Time Limits for TeHts 



The lmplication§''of various time limits for tests of a fix^d "length" * 

fx' 

or number of, items seems obvious* If the time permitted Is much 'too short, 
scores may bunch up at the low, end of the potential score range, with a loss 
of potential variance and a general diminishing of the utility of the variance 
which is obse;:ved. If the time perpitted is an intermediate value, the scores 
tend to be some mixture of speed and power in the sense of Gulliksen (1950) • 
Only with the most generous time limits do teats fully become power tests* 
Seldom are these generous times a practical possibility, however. Ordinarily 
time limits must meet the needs of the test administrator as much as the test 
taker* Under these conditions, som^ people finish, but some do not. 

Such individual ^differences in the rate of work on objective tests comr 
plicate the meaning of the measures themselves^ in the sense that they introduce 
factorial complexity. Accordingly, there is a general effort to make tests 
predominantly speed or power/ and In most cases of measures of educational 
attainment the effort is made to create a power test. That is, the time limits 
are set so that most subjects complete the work^ 

No single technique for characterizing test speededness is widely estab-^ 
lished. Educational Testing Service has long, focussed on thr^e characteristics 
of the completion activity of the population taking the test: (a) the percent 

4 

completing the test, (b) the percent completing 75% of the test* and (c) the 
test item at which approximately 80% of the total group are still working. 
These data are combined as critetlon which , according to Swineford (1956) make 
a test speeded, if (1) fewer than 100% of the candidates reach 75% of the items, 
and (2) fewer than 80% of the candidates finish 100% of the items. 



-2- 



Evans and Reilly (1972) used Swineford's criteria for speededTCse but 
introduced a graphic technique which plots the percent of candidates still 
t^orking at various points in the test. Their oyn presentation used a basie 
line of number of items. A more general approatslj would simply plot '^percent 
of subjects still working" as a function of "percent of test worked on''. 
An example would be 



.% of 
subjects 
still 
working 



100 



80 





\ 


* 






1 
1 

i 

1 ■ 
1 
1 
1 




■ \! 

1 

" 1 

. i- 



75 



% of Test Completed 



100 



Such. diagrams, will cha^TaCtetlstically exhibit the picture of a square 

with a '"chunk" missing in the jpper right h^nd corner. Tlais is so because 

-most testS' do not show "dropout/* in. the sense of an appreciable percentage 

no longer reaching certain items, until the subjects are well into the test. 

Analogously, most tests are completed by most 6f the subjefcts. Wliile the 

specific patterns of the function will vary, the 'curve will* slope downward 

from the top to the right hand vertical axis. By Swineford's criteria, if \' 

this descending IJne lies within the region bounded by 75% of the test and 80% 

I 

of the subjects, it is unspeeded* Departure from this region is a signal of 
speeds ' - * > * 



Stafford (1971) has proposed a '^epeededness quotient/* defined as 



' SQ 



where ' U ■ the number of Items Not Reached 

0 * the number of Iten^s. Omitted 
W « the number of. items Wrong 

According to Stafford > this index has some advantages over earlier 
* indices proposed by Cronbach and Warrington (1951) and by GuH::Lksen (1950). 
These earlier 'suggestions formed ratios of variances of attempted items with 
total test scores on wrong answers* Stafford asserts that they were inordinately 
difficult to compute and\ to interpret. 

All of these indices differ from the ETS approacli in appraising variance 

■ . ^ ( ^ 

In che number of items completed by considering^ variance in the number of 
errors. This is an important distinction between the two approaches. The 
ro«ults yielded will differ somewhat, too. Tests unspeeded by £!wineford's 
criterion could show moderate speed by Stafford's criterion, approaching a 
value of about #25 as a l;Lmit« The general concern about speed' and the i 
widespread valuing of power, however > 1ias ended to be psychometric, obscuring 

7 

some other more practical considerations. Time limits tend to be set as they 
are because there are some real world needs. Time limits reflect these needs^ 
so that .a test conf onns to a classroom period or to a three he * work session 
more from a need to conform to the institutional context than any psychometric 
factors. 

The establishing of time limits introduces some practical problems, however, 
because of range of individual differences in work rates which i{> exliibited. 
Directors of large national prbgrams have reported that in some tests or section 



of testa y where large nutsbers of candidates work together In a room^ otome 
finish early, creating admlnis tractive difficulty for' proctors. While tUc, 
olivlous solution to this problem is to give the c^dldated wore , to do. It 
is often unclear how much more material should be added, or by how touch the 
time should be reduced* It is desirable' then, to know something about the 
distribution of finishing times (or work rates) for various kinds of teats 
and varying composites of time allowance and test length* 

This paper proposes a simple techhlc^ue for estimating the mean and 
Standard deviation of the distribution of finishing tlfnes for a population 
of test takers. Given such values, a number of decisions concerning Lest 
specif icatlons^ can be made. ' , ^ 

A frequetit component of an item analysis Is a description of the ''drop- 

f gut" or failure to complete the test* This "dropout** is the percentage 

f A 

failing to reach the last item, and it can alsQ be computed for other items.* 
^^e-^aph adapted from Evans and Reillyi is baseil upon the dropout for a test. 

For any subject who fails to complete the test we may estimate the time 
which would be required to finish. Thus, if a person completes one-half of the 
test in the prescribed amount of time, one can assume that tlie entire test would 
be completed In twice that time. If two-thirds ts Completed, then Half again 
as much time is required. The basic assumption is that a person has a basic and 
consistent work rate, in the sense o£*items-*per'*mlnute. 

The practical consequences of this assumption ate that, for anyone failing to 
finish the test we may .estimate the time wliich would have been needed to finish. 
For those who complete the test, of course, we cannot develop an individual 
estimate. Although the number of **unlts^^ of work they accomplished is known, 
the time ^ they worked is not* The only conclusion we may make is that the 
indlviduaX finished on 6r before the moment time was called. 



Hqwover, if' we can naauw* that tlie "time needed" (or, conversely the 
work rate) is normally distributed in the group being teittdy we cm <?ii0oclat^ 
« 2*»acore with each *^ time-needed** acore, by considering the proportion of the 
natnple who exftibit scores equal to or greater than tha^one under conaldetjeTtion* 

Coftsider a 30-itcm teet with a 30-sninute time limit. By the logic above> 
the following relationships exist 



Items 


Not 


Items/ 


Time 


Reached 


Reached 


Hlnute 


Needed 


30 


0 




430.00 wlnu. 


29 


X ' 


0.97 


30.93 '*y 
32.26 " 
33.33 " 


28 . 
27 


2 
3 


0.93 
0,90* 


26 




0.87 


34.48 " 


25 

• • 


5 

« 


0.83 

« 


36.14 

« 


• 

« 

20 


« 
• 

10 


t « 
• 

0.67 


• 
t 

45.00 " 



Suppose that the test is completed by 67% of the group. This indicates a 
5t-8core of 0.<i5 for this valuc#67Z of the group have work rates, expressed as 
items per minute, which equal or exceed a rate of 1.00. Suppose further that 
those reaching 29 ctr more of the items constitute Hl% ox the fJ^oup. This 
Indicates a z-score of 0.86, a tlme-necded score of 30.93. 

Under the assumption of normality the Information must be conaistent for 

each of our two points. Tliat is, a raw score <a "titne needed" score stated 

in minutes) is a linear function of a ^-score, a person's position in the normal 
r 

distribution, and aince the two pqints are supposedly linearly related, they 
should obey the function 

S - (g) (o) + M 



where £ iu the obtjerved '^tlme needed" &core> 2 is the tioxwail curve 
z^acoto liukad to the percentpge with a score equal to or less tlian S , 
and a and M are unknown group parametera* 

Cofttinuina the foregoirsg diBcuaslon, the a^iBuGptions yield two eisAul- 
taneoua equations £or the data given above 

30*00 - .A5o ^< M " 

30, 77 « .B^:f ^ M 

they are eoived to give values for M and o ^ ot about 29*15 and 1»83 rei>pec- ^ 
♦ I ♦ * 

tlvuly, Tlwt is, if. the aasumptiona are correct about a fairly steady work 
rate and a normal distribution of these ratea^ then the knowledge of twd points 
on the d#;stributlon of ilot Reached scores enables us to estlsmte the lacon and 
standard fJevlation of the timu needed^ 

With the knowledge that the average person completisii the teat in 29*15, 
minutes, iind with the inforctiation that Jtlie suandard deviation is equal to 1.88, 
one can ipstitnate values of H± 2o . these would be 32.91 minutes and 2s/. 39 
tdinuti^s. These data would suggest that the te^t is reasonably well tlnied^ At 
the most, the very fOv^test workers are finishing 4bout five minutes early, the 
very slowest might require mom tiixeu minutes longer. 

Ttiese hypothetical data consider only two points. In fact, an Item analysis 
may yield a nuxnber of such points* Figure 1 shows the relationship between 
z-stotM, determined by the proportion of the papulation reaching a certain 
level, and the nuij^ber of items reached. Xlne essential linearity would seem to 
confirm the hypothesis of normality. As indicated , these data are for a iO-ltW 
test, administered in 30 minutes* The item mterlal was "data aufficleney/' a 
mathex^liical item typ^f. The '"items completable" eatrleo ohow that on the average 




4}ppt*jydmtely H itv^^^ w*»nl#! fan r<*ac.hcd in 30 sinutc^^i^ that the standard 
deviation of item roaterlal firilsV^ed woyld be about 7.8» 

theae data eao be used to find Che time sdceded for 30 i terns • The oscan 
is IB minutes, %he ^ta^dard deviation abuut 5»4 tolnutea- By this esttciat^e 
about 16i? of the populatloi^ ate finishing five «ainut^B or mote before tiP)e is 
called, and the very fastest workerat two standard deviations from the pean, 
are finishing 10 minutes early* Sucli a sizable group of early finishers might 
well constitute an administrative problem* , \ 

figure 2 tihovs Che plot for another 30-ttem, 30-iainute prettiSt of mthe- 

matics material. Again, linearity seems observed. Ihe lines in aU figures 
t 

in this paper arc elisply dravm in by hand, but there are few departjurcs from 
a Bct of points which hj> easily connected, * 

Figure i ^jhowG .data for a verbal pretest of 55 items in 20 minuter* 
figure 4 pre^entu a similar chart, but yit^* a marked departure from llKiearity* 
Apput^r^tly the last five Iteics in this pretest possessed some unui:iual charac- 
tiirrtstlc which dmcmdiid Kore tii^e. Instead of * the approxir^tely /5 item which 
might have been anticipated as an average completion only 50Z or ^o of the 
g.rfiup finished the te^^tt 

This Ih an anu!;ual rct^ult. Figures 5-10 show the ^lots *f or alK svctLom 
of two Euglish Composition tt^3U. In each case the utility of .'the linear 
ratlr^ater nan be defended, although yolntu fox valutas of 2-abovc 7,0 te<*d to 
flatten off. fl^lo ""f lattenlng*" may be due to th^^ cfejracteribtics of rh^i material 
or to die very amall nu^^uts of eubjects. ♦ ^ 

Whether BwhAnt i have practical value for pred;lcting flnt9bi?>g tl'^es ref^uircs 

*■ • 

m i,-mpirlr./il te?it. Iri a ucme^ they are Ciarpri^lng in that they Bufr^^xlzc the 
eicperlence of quite heterogcnfioii*i groups of propl^*. In t^'trza of ability 



\ 

vith hetr^roRcncaua ^toup^ o£ Iti^^?^ In tciiXi^ c£ difficulty but tisey UQuld b^qs^ 
to confons rachqr cloaely to a \md^l which anticipates^ a mtmlly distributed 
ir,om/idmte uotk rate* 

The "aberrant \plot In Figure 4 suggests the cotsplc»lty of tlvb '*test 
BTfeedcineBQ^^ concep'tal It would appear that the last five ite»s were found 
uiiuaudliy tim^comminz the subjects « to a deg;ree vhlchi ran count^er to the ' 
average of such iteu^a* Is the test speeded? In sotue way^ the iPtroducfcion of 
difficulty time consuiaiog later csatcrlal may aasilst in the creatlopg of orderly 
eifaMitatlon topm^t It c^a^y also affect the dietributiojpi of ©cor§9. It would be 
intereatiitg to consider ake^ncas dat^a as they relate to epccdcdncBa data of 
this type. # \ * 

li the basit diagraxa of ReiUy ani JacksoR io recoristltuted go that the 
ordinate, or per«;e»t rcachliig axis 10 xcdcfincd m a x-ocorc axio^ the pioto of 
8'J*ch 4<it^ slwjid ,^ppr«Ki«§^'**t;oly ni? f^Xloijf^t 



0 — 



1 



50 



100 



While thi«? diagram couM be JEyrthcr* codified to redcftea tlio base XiRe 
is^rc clearly a:: rate variable^ the essential foeua iB established* 



10 



iiwltcitxtfcnslop, ifiilcatc4^y the dotted line* Theoretically the blope 
of ttie line Is detcitsircd by th^: c^^llectloa of isatc*rlaX. whidj l4i Incorporated 
into the tost* A tC3t which hm ^ Hat slope in jsoro likely tp be epccdedt 
in BQm QGmea, thm amthet. Ihe fci(ak of the teatxcoaattiictor In to develop 
tcsp filiijrh have steep slopes* Sulncih^d^s criteria cdn oow be iiestated in 
the n&i ihtmlati^mi a teat is lumipccdcd If the equation of Its *'dtopoat** 
Wise, ^rr.jft^i?re deflfiedy * 

■ • . ' ; I ■ ■ 

K*« the perccii^t of the test cocrpleted 

or a line witlx Gtc6pcr slope. T#5C extension of this work to conaldutationa 
p03ed by Stafford is clearly ouggested^ * • 



i CtmhAth, & Wayrl^scen, M. C. lire lirlt t£9r;r»:__ ^KiJtl'casl^ig toeir 

Evanj, V# , i'PQilly, B. A acudy of apcededncss as a ««3urce ot teat M^S'i. 

jQurj^aX ol Edacatloaal Hgaggresfeat . 1972* 9(2), 123-131. 
GulMkacns, H- tlicory of cental testa « Ktiv VorkJ John Wiley ajsd Sop3, 1950. 

i 

Stafford, i;. Tne 8pecdedm3« quotlcj^ti Anew descriptive tJtatlQllc tor 
tC3tS« iouTOal oi £4y£aUoi?^ ?!easure^f^t > 1971, 8(4 )\ 

S**lneCord» F* 'X€c>%nlcal, aaixaiA tor users of tc^^ an.^lyoer>* StaUstlcal R€p<)trt 
5fr-42. Pi;i?^cetos^t N*J*; Edusratlonal Teotlr*^ Service, 1956* 



ERIC 



•16- 




